EU_Social_Progress_Index_2024¶
Fuente: Comisión Europea https://commission.europa.eu/index_en
!pip install mapclassify
!pip install geopandas
Requirement already satisfied: mapclassify in c:\users\pablo-pc\anaconda3\lib\site-packages (2.8.0) Requirement already satisfied: networkx>=2.7 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (3.1) Requirement already satisfied: numpy>=1.23 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (1.26.4) Requirement already satisfied: pandas!=1.5.0,>=1.4 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (2.1.4) Requirement already satisfied: scikit-learn>=1.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (1.2.2) Requirement already satisfied: scipy>=1.8 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (1.11.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2023.3) Requirement already satisfied: joblib>=1.1.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from scikit-learn>=1.0->mapclassify) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from scikit-learn>=1.0->mapclassify) (2.2.0) Requirement already satisfied: six>=1.5 in c:\users\pablo-pc\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas!=1.5.0,>=1.4->mapclassify) (1.16.0) Requirement already satisfied: geopandas in c:\users\pablo-pc\anaconda3\lib\site-packages (1.0.1) Requirement already satisfied: numpy>=1.22 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (1.26.4) Requirement already satisfied: pyogrio>=0.7.2 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (0.9.0) Requirement already satisfied: packaging in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (23.1) Requirement already satisfied: pandas>=1.4.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (2.1.4) Requirement already satisfied: pyproj>=3.3.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (3.6.1) Requirement already satisfied: shapely>=2.0.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (2.0.5) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas>=1.4.0->geopandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3) Requirement already satisfied: certifi in c:\users\pablo-pc\anaconda3\lib\site-packages (from pyogrio>=0.7.2->geopandas) (2024.2.2) Requirement already satisfied: six>=1.5 in c:\users\pablo-pc\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.4.0->geopandas) (1.16.0)
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
import requests
import datetime
import geopandas as gpd
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
Lectura del dataset¶
path = 'Datasets/EU-SPI 2.0_2024_raw_data.xlsx'
df = pd.read_excel(path)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 242 entries, 0 to 241 Data columns (total 48 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 242 non-null object 1 NUTS code 242 non-null object 2 RegionName 242 non-null object 3 Infant mortality 242 non-null float64 4 Satisfaction with water quality 236 non-null float64 5 Uncollected sewage 242 non-null float64 6 Sewage treatment, additional 242 non-null float64 7 Safety at night 236 non-null float64 8 Money Stolen 236 non-null float64 9 Assaulted/Mugged 236 non-null float64 10 Traffic deaths 242 non-null int64 11 Share of low-achieving 15 year olds in reading (level 1a or lower) 223 non-null float64 12 Share of low-achieving 15 year olds in maths and science 242 non-null float64 13 Lower-secondary completion only 241 non-null float64 14 Early school leavers 237 non-null float64 15 Broadband at home 241 non-null float64 16 Digital skills above basic level 236 non-null float64 17 Online interaction with public authorities 241 non-null float64 18 Internet access 236 non-null float64 19 Freedom of media 236 non-null float64 20 Subjective health status 236 non-null float64 21 Standardised cancer death rate 242 non-null float64 22 Standardised heart disease death rate 242 non-null float64 23 Years of life lost due to air pollution 234 non-null float64 24 Index of positive emotions 236 non-null float64 25 Air pollution NO2 234 non-null float64 26 Air pollution Ozone (SOMO35) 234 non-null float64 27 Air pollution pm2.5 234 non-null float64 28 Bathing water quality 216 non-null float64 29 Trust in the national government 236 non-null float64 30 Trust in the legal system 236 non-null float64 31 Trust in the police 236 non-null float64 32 Voiced opinion to public official 236 non-null float64 33 Female participation in regional assemblies 242 non-null float64 34 Institution quality index 240 non-null float64 35 Freedom over life choices 236 non-null float64 36 Job opportunities 236 non-null float64 37 Teenage pregnancy 242 non-null float64 38 Young people not in education, employment or training (NEET) 238 non-null float64 39 Institutions corruption index (EQI) 240 non-null float64 40 Institution impartiality index (EQI) 240 non-null float64 41 Tolerance towards immigrants 236 non-null float64 42 Tolerance towards minorities 236 non-null float64 43 Tolerance towards gay or lesbian people 236 non-null float64 44 Women treated with respect 236 non-null float64 45 Tertiary education attainment 242 non-null float64 46 Lifelong learning 241 non-null float64 47 Academic citations per 1000 persons 238 non-null float64 dtypes: float64(44), int64(1), object(3) memory usage: 90.9+ KB
Lectura de datos espaciales¶
Fuente: Eurostat
https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/territorial-units-statistics
# Cargar los shapefiles de las regiones NUTS
shapefile_path = 'NUTS_RG_20M_2021_3035.shp'
gdf = gpd.read_file(shapefile_path)
gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 2010 entries, 0 to 2009 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NUTS_ID 2010 non-null object 1 LEVL_CODE 2010 non-null int32 2 CNTR_CODE 2010 non-null object 3 NAME_LATN 2010 non-null object 4 NUTS_NAME 2010 non-null object 5 MOUNT_TYPE 2009 non-null float64 6 URBN_TYPE 2010 non-null int32 7 COAST_TYPE 2010 non-null int32 8 FID 2010 non-null object 9 geometry 2010 non-null geometry dtypes: float64(1), geometry(1), int32(3), object(5) memory usage: 133.6+ KB
GDP¶
- Gross domestic product per capita at current market prices by NUTS 2 regions [nama_10r_2gdp]
https://ec.europa.eu/eurostat/databrowser/view/nama_10r_2gdp/default/table?lang=en
Unit of measure [UNIT]: Euro per inhabitant [EUR_HAB]
path = 'Datasets/gdp_pp.xlsx'
gdp_pp = pd.read_excel(path)
gdp_pp.head()
| GEO (Codes) | GEO (Labels) | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EU27_2020 | European Union - 27 countries (from 2020) | 18400 | 19200 | 19900 | 20300 | 21200 | 22000 | 23200 | 24600 | 25300 | 24100 | 24900 | 25700 | 25800 | 26000 | 26600 | 27500 | 28200 | 29300 | 30300 | 31300 | 30100 | 32700 | 35400 |
| 1 | BE | Belgium | 25000 | 25700 | 26400 | 27100 | 28500 | 29600 | 30800 | 32300 | 32800 | 32100 | 33300 | 34100 | 34800 | 35200 | 36000 | 37000 | 38000 | 39100 | 40300 | 41700 | 39900 | 43800 | 47400 |
| 2 | BE1 | Région de Bruxelles-Capitale/Brussels Hoofdste... | : | : | : | 54900 | 57400 | 58700 | 60600 | 62200 | 61800 | 60900 | 62500 | 62200 | 63300 | 63400 | 64500 | 66100 | 66700 | 68400 | 69600 | 71700 | 68300 | 73400 | 77800 |
| 3 | BE10 | Région de Bruxelles-Capitale/Brussels Hoofdste... | : | : | : | 54900 | 57400 | 58700 | 60600 | 62200 | 61800 | 60900 | 62500 | 62200 | 63300 | 63400 | 64500 | 66100 | 66700 | 68400 | 69600 | 71700 | 68300 | 73400 | 77800 |
| 4 | BE2 | Vlaams Gewest | : | : | : | 26800 | 28100 | 29300 | 30700 | 32400 | 32900 | 32000 | 33200 | 34100 | 34900 | 35400 | 36200 | 37400 | 38600 | 39800 | 40900 | 42200 | 40600 | 45200 | 49000 |
gdp_pp.replace(':', np.nan, inplace=True)
gdp_pp_data = gdp_pp[['GEO (Codes)','2022']]
gdp_pp_data = gdp_pp_data.copy()
gdp_pp_data.rename(columns={'2022': 'gdp_per_capita_2022'}, inplace=True)
df = pd.merge(df, gdp_pp_data, how='left', left_on='NUTS code', right_on='GEO (Codes)')
df = df.drop(columns = 'GEO (Codes)')
Análisis Exploratorio¶
El dataset cuenta con:
- 242 entradas: regiones de UE
- 49 columnas, de las cuales 1 es el PIB per cápita y 3 corresponden a información geográfica: país, región y código NUTS (identificador estándar)
df.shape
(242, 49)
Los tipos de datos parecen ser correctos
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 242 entries, 0 to 241 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 242 non-null object 1 NUTS code 242 non-null object 2 RegionName 242 non-null object 3 Infant mortality 242 non-null float64 4 Satisfaction with water quality 236 non-null float64 5 Uncollected sewage 242 non-null float64 6 Sewage treatment, additional 242 non-null float64 7 Safety at night 236 non-null float64 8 Money Stolen 236 non-null float64 9 Assaulted/Mugged 236 non-null float64 10 Traffic deaths 242 non-null int64 11 Share of low-achieving 15 year olds in reading (level 1a or lower) 223 non-null float64 12 Share of low-achieving 15 year olds in maths and science 242 non-null float64 13 Lower-secondary completion only 241 non-null float64 14 Early school leavers 237 non-null float64 15 Broadband at home 241 non-null float64 16 Digital skills above basic level 236 non-null float64 17 Online interaction with public authorities 241 non-null float64 18 Internet access 236 non-null float64 19 Freedom of media 236 non-null float64 20 Subjective health status 236 non-null float64 21 Standardised cancer death rate 242 non-null float64 22 Standardised heart disease death rate 242 non-null float64 23 Years of life lost due to air pollution 234 non-null float64 24 Index of positive emotions 236 non-null float64 25 Air pollution NO2 234 non-null float64 26 Air pollution Ozone (SOMO35) 234 non-null float64 27 Air pollution pm2.5 234 non-null float64 28 Bathing water quality 216 non-null float64 29 Trust in the national government 236 non-null float64 30 Trust in the legal system 236 non-null float64 31 Trust in the police 236 non-null float64 32 Voiced opinion to public official 236 non-null float64 33 Female participation in regional assemblies 242 non-null float64 34 Institution quality index 240 non-null float64 35 Freedom over life choices 236 non-null float64 36 Job opportunities 236 non-null float64 37 Teenage pregnancy 242 non-null float64 38 Young people not in education, employment or training (NEET) 238 non-null float64 39 Institutions corruption index (EQI) 240 non-null float64 40 Institution impartiality index (EQI) 240 non-null float64 41 Tolerance towards immigrants 236 non-null float64 42 Tolerance towards minorities 236 non-null float64 43 Tolerance towards gay or lesbian people 236 non-null float64 44 Women treated with respect 236 non-null float64 45 Tertiary education attainment 242 non-null float64 46 Lifelong learning 241 non-null float64 47 Academic citations per 1000 persons 238 non-null float64 48 gdp_per_capita_2022 242 non-null float64 dtypes: float64(45), int64(1), object(3) memory usage: 92.8+ KB
Datos perdidos¶
No hay datos perdidos en las variables categoricas.
Alguos datos perdidos en los indicadores
En general pocos datos perdidos, no mas de un 3.5% por cada variable, excepto:
- Bathing water quality: 10%
- Share of low-achieving 15 year olds in reading (level 1a or lower): 7%
[Figura 00.1]
nan_count = df.isnull().sum()
nan_percentage = round((nan_count / len(df)) * 100, 2)
nan_summary = pd.DataFrame({
'indicador': nan_count.index,
'missing': nan_count.values,
'proporcion': nan_percentage.values
})
nan_summary = nan_summary[nan_summary['missing'] > 0]
print("Figura 00.1")
print(nan_summary)
Figura 00.1
indicador missing proporcion
4 Satisfaction with water quality 6 2.48
7 Safety at night 6 2.48
8 Money Stolen 6 2.48
9 Assaulted/Mugged 6 2.48
11 Share of low-achieving 15 year olds in reading... 19 7.85
13 Lower-secondary completion only 1 0.41
14 Early school leavers 5 2.07
15 Broadband at home 1 0.41
16 Digital skills above basic level 6 2.48
17 Online interaction with public authorities 1 0.41
18 Internet access 6 2.48
19 Freedom of media 6 2.48
20 Subjective health status 6 2.48
23 Years of life lost due to air pollution 8 3.31
24 Index of positive emotions 6 2.48
25 Air pollution NO2 8 3.31
26 Air pollution Ozone (SOMO35) 8 3.31
27 Air pollution pm2.5 8 3.31
28 Bathing water quality 26 10.74
29 Trust in the national government 6 2.48
30 Trust in the legal system 6 2.48
31 Trust in the police 6 2.48
32 Voiced opinion to public official 6 2.48
34 Institution quality index 2 0.83
35 Freedom over life choices 6 2.48
36 Job opportunities 6 2.48
38 Young people not in education, employment or t... 4 1.65
39 Institutions corruption index (EQI) 2 0.83
40 Institution impartiality index (EQI) 2 0.83
41 Tolerance towards immigrants 6 2.48
42 Tolerance towards minorities 6 2.48
43 Tolerance towards gay or lesbian people 6 2.48
44 Women treated with respect 6 2.48
46 Lifelong learning 1 0.41
47 Academic citations per 1000 persons 4 1.65
filtered_df = df[['Country', 'Bathing water quality', 'Share of low-achieving 15 year olds in reading (level 1a or lower)']]
grouped_df = filtered_df.groupby('Country')
nan_summary_list = []
for name, group in grouped_df:
nan_count = group.isnull().sum()
nan_percentage = round((nan_count / len(group)) * 100, 2)
nan_summary = pd.DataFrame({
'Country': name,
'indicador': nan_count.index,
'missing': nan_count.values,
'proporcion': nan_percentage.values
})
nan_summary_list.append(nan_summary)
nan_summary_final = pd.concat(nan_summary_list)
nan_summary_final = nan_summary_final[nan_summary_final['missing'] > 0]
nan_summary_final
| Country | indicador | missing | proporcion | |
|---|---|---|---|---|
| 1 | BE | Bathing water quality | 4 | 36.36 |
| 1 | BG | Bathing water quality | 4 | 66.67 |
| 1 | CZ | Bathing water quality | 1 | 12.50 |
| 1 | DE | Bathing water quality | 2 | 5.26 |
| 1 | EL | Bathing water quality | 1 | 7.69 |
| 1 | ES | Bathing water quality | 1 | 5.26 |
| 2 | ES | Share of low-achieving 15 year olds in reading... | 19 | 100.00 |
| 1 | HR | Bathing water quality | 1 | 25.00 |
| 1 | HU | Bathing water quality | 1 | 12.50 |
| 1 | IT | Bathing water quality | 1 | 4.76 |
| 1 | RO | Bathing water quality | 7 | 87.50 |
| 1 | SE | Bathing water quality | 1 | 12.50 |
| 1 | SK | Bathing water quality | 2 | 50.00 |
Los datos perdidos de "Bathing water quality" están repartidos entre paises, sin embargo el 100% de perdidos de "Share of low-achieving 15 year olds in reading..." están en España.
Comprobamos que el indicador "Share of low-achieving 15 year olds in reading (level 1a or lower)" no tiene datos para España
df[df['Country']=="ES"]['Share of low-achieving 15 year olds in reading (level 1a or lower)']
92 NaN 93 NaN 94 NaN 95 NaN 96 NaN 97 NaN 98 NaN 99 NaN 100 NaN 101 NaN 102 NaN 103 NaN 104 NaN 105 NaN 106 NaN 107 NaN 108 NaN 109 NaN 110 NaN Name: Share of low-achieving 15 year olds in reading (level 1a or lower), dtype: float64
Comprobamos si existe algún indicador más con todos los datos perdidos en algún pais:
[Figura 00.2]
missing_all_by_country = {}
# Obtener la lista de países únicos
paises = df['Country'].unique()
# Iterar sobre cada país
for pais in paises:
# Filtrar el DataFrame por el país actual
df_pais = df[df['Country'] == pais]
# Encontrar las variables con todos los datos perdidos para este país
variables_con_todos_perdidos = df_pais.columns[df_pais.isnull().all()].tolist()
# Guardar el resultado en el diccionario
missing_all_by_country[pais] = variables_con_todos_perdidos
print("Figura 00.2")
print()
# Mostrar el resultado
for pais, variables in missing_all_by_country.items():
print(f"País: {pais}, Variables con todos los datos perdidos: {variables}")
Figura 00.2 País: AT, Variables con todos los datos perdidos: [] País: BE, Variables con todos los datos perdidos: [] País: BG, Variables con todos los datos perdidos: [] País: CY, Variables con todos los datos perdidos: ['Digital skills above basic level'] País: CZ, Variables con todos los datos perdidos: [] País: DE, Variables con todos los datos perdidos: [] País: DK, Variables con todos los datos perdidos: [] País: EE, Variables con todos los datos perdidos: ['Digital skills above basic level'] País: EL, Variables con todos los datos perdidos: [] País: ES, Variables con todos los datos perdidos: ['Share of low-achieving 15 year olds in reading (level 1a or lower)'] País: FI, Variables con todos los datos perdidos: [] País: FR, Variables con todos los datos perdidos: [] País: HR, Variables con todos los datos perdidos: [] País: HU, Variables con todos los datos perdidos: [] País: IE, Variables con todos los datos perdidos: [] País: IT, Variables con todos los datos perdidos: [] País: LT, Variables con todos los datos perdidos: [] País: LU, Variables con todos los datos perdidos: ['Digital skills above basic level'] País: LV, Variables con todos los datos perdidos: ['Digital skills above basic level'] País: MT, Variables con todos los datos perdidos: ['Digital skills above basic level'] País: NL, Variables con todos los datos perdidos: [] País: PL, Variables con todos los datos perdidos: [] País: PT, Variables con todos los datos perdidos: [] País: RO, Variables con todos los datos perdidos: [] País: SE, Variables con todos los datos perdidos: [] País: SI, Variables con todos los datos perdidos: [] País: SK, Variables con todos los datos perdidos: []
Estadísticas descriptivas¶
[Figura 00.3]
round(df.describe(),2)
| Infant mortality | Satisfaction with water quality | Uncollected sewage | Sewage treatment, additional | Safety at night | Money Stolen | Assaulted/Mugged | Traffic deaths | Share of low-achieving 15 year olds in reading (level 1a or lower) | Share of low-achieving 15 year olds in maths and science | Lower-secondary completion only | Early school leavers | Broadband at home | Digital skills above basic level | Online interaction with public authorities | Internet access | Freedom of media | Subjective health status | Standardised cancer death rate | Standardised heart disease death rate | Years of life lost due to air pollution | Index of positive emotions | Air pollution NO2 | Air pollution Ozone (SOMO35) | Air pollution pm2.5 | Bathing water quality | Trust in the national government | Trust in the legal system | Trust in the police | Voiced opinion to public official | Female participation in regional assemblies | Institution quality index | Freedom over life choices | Job opportunities | Teenage pregnancy | Young people not in education, employment or training (NEET) | Institutions corruption index (EQI) | Institution impartiality index (EQI) | Tolerance towards immigrants | Tolerance towards minorities | Tolerance towards gay or lesbian people | Women treated with respect | Tertiary education attainment | Lifelong learning | Academic citations per 1000 persons | gdp_per_capita_2022 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 242.00 | 236.00 | 242.00 | 242.00 | 236.00 | 236.00 | 236.00 | 242.00 | 223.00 | 242.00 | 241.00 | 237.00 | 241.00 | 236.00 | 241.00 | 236.00 | 236.00 | 236.00 | 242.00 | 242.00 | 234.00 | 236.00 | 234.00 | 234.00 | 234.00 | 216.00 | 236.00 | 236.00 | 236.00 | 236.00 | 242.00 | 240.00 | 236.00 | 236.00 | 242.00 | 238.00 | 240.00 | 240.00 | 236.00 | 236.00 | 236.00 | 236.00 | 242.00 | 241.00 | 238.00 | 242.00 |
| mean | 3.21 | 82.34 | 2.40 | 84.30 | 75.53 | 7.98 | 3.12 | 48.38 | 22.91 | 45.24 | 19.70 | 9.56 | 89.49 | 26.74 | 60.05 | 89.46 | 72.18 | 68.13 | 78.71 | 47.12 | 552.96 | 71.68 | 12.92 | 4182.70 | 10.90 | 0.78 | 44.27 | 56.53 | 79.14 | 23.58 | 33.09 | 0.10 | 80.04 | 48.59 | 8.19 | 10.33 | 0.13 | 0.11 | 71.46 | 75.72 | 64.96 | 71.18 | 33.27 | 12.09 | 3.76 | 33784.71 |
| std | 1.45 | 9.38 | 7.91 | 23.34 | 7.63 | 2.75 | 1.82 | 24.06 | 6.98 | 9.01 | 11.43 | 4.45 | 5.01 | 10.00 | 20.13 | 7.21 | 16.62 | 7.44 | 18.71 | 32.13 | 404.43 | 4.67 | 4.39 | 1418.50 | 3.58 | 0.22 | 14.61 | 16.04 | 8.51 | 7.36 | 12.07 | 0.98 | 10.01 | 17.32 | 10.73 | 4.94 | 1.00 | 0.98 | 14.00 | 10.89 | 20.63 | 11.39 | 10.32 | 7.60 | 3.48 | 17842.51 |
| min | 0.00 | 43.41 | 0.00 | 0.00 | 55.53 | 1.84 | 0.57 | 0.00 | 11.05 | 31.01 | 1.50 | 1.43 | 73.50 | 6.88 | 10.46 | 62.33 | 28.85 | 25.22 | 46.32 | 18.02 | 0.00 | 56.10 | 0.30 | 1524.81 | 3.64 | 0.05 | 6.47 | 9.93 | 46.14 | 9.29 | 4.00 | -2.70 | 48.08 | 11.30 | 1.04 | 2.73 | -2.56 | -2.45 | 24.85 | 44.58 | 9.17 | 34.53 | 13.70 | 0.90 | 0.00 | 8500.00 |
| 25% | 2.30 | 76.93 | 0.00 | 79.33 | 71.18 | 6.12 | 1.55 | 32.00 | 20.69 | 40.67 | 12.30 | 6.33 | 86.72 | 20.33 | 45.86 | 87.34 | 64.42 | 64.31 | 66.58 | 26.10 | 267.95 | 68.01 | 10.51 | 3129.04 | 8.59 | 0.70 | 35.07 | 44.73 | 76.81 | 17.71 | 23.85 | -0.70 | 75.26 | 35.17 | 3.22 | 6.99 | -0.67 | -0.72 | 64.78 | 69.47 | 52.16 | 62.81 | 25.88 | 7.20 | 1.25 | 20375.00 |
| 50% | 3.00 | 83.31 | 0.00 | 95.71 | 75.89 | 7.83 | 2.82 | 43.00 | 20.94 | 42.36 | 17.50 | 8.60 | 89.69 | 23.37 | 61.84 | 90.74 | 75.18 | 68.55 | 74.85 | 34.30 | 405.40 | 72.81 | 12.59 | 3990.71 | 10.00 | 0.85 | 43.12 | 55.56 | 81.41 | 23.21 | 31.93 | 0.26 | 82.85 | 48.24 | 5.71 | 9.23 | -0.03 | 0.22 | 74.07 | 78.94 | 70.35 | 73.08 | 32.30 | 10.60 | 2.80 | 32700.00 |
| 75% | 3.70 | 88.85 | 0.00 | 99.46 | 80.92 | 9.86 | 4.22 | 60.00 | 23.85 | 49.14 | 25.70 | 12.10 | 92.49 | 32.76 | 76.56 | 94.39 | 84.57 | 73.18 | 86.49 | 54.03 | 781.77 | 74.66 | 15.50 | 4922.12 | 12.98 | 0.93 | 55.98 | 70.66 | 84.14 | 29.25 | 45.02 | 0.88 | 87.19 | 61.59 | 7.34 | 12.79 | 1.03 | 0.93 | 80.92 | 83.22 | 79.48 | 79.80 | 40.65 | 14.50 | 5.37 | 44125.00 |
| max | 9.70 | 99.91 | 55.56 | 100.00 | 92.74 | 15.13 | 8.14 | 159.00 | 47.10 | 71.05 | 58.70 | 26.03 | 100.00 | 52.53 | 94.35 | 99.73 | 98.09 | 81.20 | 147.60 | 179.74 | 1871.59 | 81.61 | 31.18 | 8620.22 | 23.24 | 1.00 | 84.88 | 91.16 | 92.30 | 46.92 | 53.66 | 2.49 | 96.67 | 86.56 | 71.78 | 28.63 | 2.14 | 2.81 | 96.28 | 90.67 | 94.99 | 93.38 | 62.10 | 38.10 | 21.27 | 120300.00 |
Outliers¶
Outliers con Z-Score
El Z-score mide cuántas desviaciones estándar un dato se encuentra por encima o por debajo de la media de la distribución
Z = (X - μ)\σ
Un Z-score superior a 3 o inferior a -3 generalmente se considera un outlier
[Figura 00.4]
from scipy import stats
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
z_scores = np.abs(stats.zscore(df[num_cols]))
outliers = (z_scores > 3).any(axis=1)
outliers_df = df[outliers]
print("Figura 00.4")
print()
print(outliers_df)
Figura 00.4
Country NUTS code RegionName Infant mortality \
20 BG BG31 Severozapaden 7.1
21 BG BG32 Severen tsentralen 6.4
22 BG BG33 Severoiztochen 5.8
23 BG BG34 Yugoiztochen 9.6
24 BG BG41 Yugozapaden 3.1
25 BG BG42 Yuzhen tsentralen 5.1
73 DK DK01 Hovedstaden 2.9
138 FR FRY1 Guadeloupe 9.7
139 FR FRY2 Martinique 9.1
140 FR FRY3 Guyane 8.0
142 FR FRY5 Mayotte 8.8
144 HR HR03 Jadranska Hrvatska 4.2
145 HR HR05 Grad Zagreb 2.2
149 HU HU21 Közép-Dunántúl 4.2
151 HU HU23 Dél-Dunántúl 2.9
152 HU HU31 Észak-Magyarország 4.7
153 HU HU32 Észak-Alföld 4.5
154 HU HU33 Dél-Alföld 3.2
156 IE IE05 Southern 3.7
157 IE IE06 Eastern and Midland 2.6
160 IT ITC3 Liguria 2.8
168 IT ITG1 Sicilia 3.4
181 LU LU00 Luxembourg 3.1
219 PT PT30 Região Autónoma da Madeira 3.4
220 RO RO11 Nord-Vest 5.8
221 RO RO12 Centru 6.2
222 RO RO21 Nord-Est 5.7
223 RO RO22 Sud-Est 6.2
224 RO RO31 Sud-Muntenia 4.9
226 RO RO41 Sud-Vest Oltenia 6.1
241 SK SK04 Východné Slovensko 8.1
Satisfaction with water quality Uncollected sewage \
20 66.030690 13.112778
21 65.559160 7.750597
22 53.167801 2.871719
23 56.738884 4.071538
24 72.361886 3.876427
25 62.183077 10.397535
73 86.230570 0.000000
138 NaN 0.000000
139 NaN 0.000000
140 NaN 0.000000
142 NaN 38.083379
144 83.390794 11.764233
145 69.245381 0.000000
149 83.906151 0.000000
151 70.721171 0.000000
152 75.840807 0.000000
153 80.710030 0.000000
154 75.607066 0.000000
156 86.026341 0.000000
157 83.535296 0.000000
160 77.921639 0.000000
168 80.354638 2.906338
181 84.259418 0.000000
219 93.903471 3.223928
220 66.868325 31.978751
221 71.426739 35.952902
222 78.529510 38.579372
223 77.661334 30.262179
224 71.285612 55.555370
226 73.554553 55.233976
241 83.981749 1.208562
Sewage treatment, additional Safety at night Money Stolen \
20 56.147543 64.362238 4.394227
21 71.623448 56.013020 4.154489
22 89.369157 60.331069 6.455514
23 83.293837 67.406096 9.026132
24 74.180576 61.447177 7.172962
25 40.105975 75.855519 5.189070
73 100.000000 81.072249 7.802786
138 78.519761 NaN NaN
139 68.353602 NaN NaN
140 0.000000 NaN NaN
142 0.603897 NaN NaN
144 0.200850 85.271090 4.635883
145 0.000000 75.959677 10.980450
149 82.287945 77.602417 6.654045
151 63.942447 80.948219 5.441167
152 68.019629 68.852846 4.486461
153 82.317694 66.717524 10.920720
154 42.585622 72.990451 10.481930
156 50.097924 74.600090 6.100404
157 90.430568 77.087155 6.932770
160 10.338217 64.716392 9.629818
168 10.513516 71.112526 9.334710
181 93.047115 76.832755 7.638356
219 10.111170 80.023656 4.292296
220 60.475349 71.532229 8.228294
221 52.602769 60.536476 2.512792
222 50.390541 61.596737 8.532092
223 60.029871 70.520433 4.682273
224 27.383876 67.770895 7.784635
226 29.270395 66.802958 5.503103
241 86.116320 70.760150 9.250217
Assaulted/Mugged Traffic deaths \
20 2.793928 133
21 1.018728 103
22 1.767896 66
23 2.371754 80
24 3.370132 75
25 1.719668 61
73 2.560730 12
138 NaN 159
139 NaN 81
140 NaN 120
142 NaN 42
144 1.496991 92
145 7.228171 31
149 1.540083 75
151 2.656142 73
152 3.388809 56
153 2.439287 64
154 3.122959 69
156 5.415271 26
157 4.241856 23
160 5.164736 39
168 3.925364 33
181 0.678235 38
219 1.827482 47
220 2.492410 83
221 1.477218 76
222 2.838239 102
223 2.550227 116
224 4.933595 109
226 2.359531 112
241 1.212579 36
Share of low-achieving 15 year olds in reading (level 1a or lower) \
20 47.102802
21 47.102802
22 47.102802
23 47.102802
24 47.102802
25 47.102802
73 15.999534
138 20.937250
139 20.937250
140 20.937250
142 20.937250
144 21.578840
145 21.578840
149 25.273944
151 25.273944
152 25.273944
153 25.273944
154 25.273944
156 11.799349
157 11.799349
160 23.267173
168 23.267173
181 29.291415
219 20.221871
220 40.838355
221 40.838355
222 40.838355
223 40.838355
224 40.838355
226 40.838355
241 31.410025
Share of low-achieving 15 year olds in maths and science \
20 68.137605
21 68.137605
22 68.137605
23 68.137605
24 68.137605
25 68.137605
73 36.541697
138 42.357320
139 42.357320
140 42.357320
142 42.357320
144 58.539807
145 58.539807
149 49.274133
151 49.274133
152 49.274133
153 49.274133
154 49.274133
156 40.407878
157 40.407878
160 46.736349
168 46.736349
181 48.909676
219 44.178392
220 71.046676
221 71.046676
222 71.046676
223 71.046676
224 71.046676
226 71.046676
241 46.440243
Lower-secondary completion only Early school leavers Broadband at home \
20 18.0 17.633333 73.50
21 15.4 9.933333 82.67
22 21.7 13.266667 85.23
23 22.1 21.633333 81.73
24 7.8 6.333333 86.23
25 19.6 12.300000 85.53
73 13.8 8.300000 94.48
138 34.4 13.250000 75.37
139 29.1 13.350000 85.81
140 48.5 26.033333 79.16
142 NaN 14.300000 83.55
144 9.2 1.633333 87.38
145 5.3 2.700000 86.06
149 13.2 11.333333 92.10
151 17.5 15.266667 89.03
152 20.6 22.233333 86.26
153 19.2 16.333333 88.04
154 14.0 10.800000 86.72
156 12.8 4.700000 90.78
157 11.2 3.933333 95.22
160 30.4 11.300000 88.85
168 47.6 19.800000 83.43
181 18.4 8.566667 97.35
219 51.7 NaN 87.14
220 17.2 15.700000 89.60
221 18.9 21.766667 89.58
222 23.9 16.100000 86.89
223 25.7 22.200000 84.29
224 20.0 15.266667 85.76
226 16.7 12.700000 85.90
241 9.6 12.433333 87.04
Digital skills above basic level \
20 6.881001
21 7.739488
22 7.979152
23 7.651486
24 8.072771
25 8.007238
73 38.182303
138 29.632703
139 29.632703
140 29.632703
142 29.632703
144 31.658243
145 30.201776
149 21.853206
151 21.124765
152 20.467508
153 20.889861
154 20.576656
156 38.597303
157 40.485076
160 22.611617
168 21.232270
181 NaN
219 29.571648
220 8.868571
221 8.866592
222 8.600337
223 8.342990
224 8.488490
226 8.502347
241 20.160605
Online interaction with public authorities Internet access \
20 17.68 71.324158
21 22.13 74.007414
22 25.93 75.417230
23 18.54 71.497558
24 36.30 84.331716
25 25.16 70.040531
73 94.35 99.265970
138 74.04 NaN
139 74.87 NaN
140 78.02 NaN
142 74.17 NaN
144 47.40 91.214831
145 45.05 97.123682
149 73.02 87.876316
151 66.93 90.464120
152 64.66 86.238250
153 63.69 91.535815
154 69.45 93.522051
156 91.00 88.381400
157 90.89 94.219075
160 35.53 95.279201
168 27.10 94.900901
181 78.20 98.421072
219 43.79 90.920181
220 12.80 87.431754
221 16.88 82.718279
222 12.05 63.628775
223 12.94 67.054536
224 10.46 62.334884
226 13.64 74.298647
241 54.14 82.243411
Freedom of media Subjective health status \
20 44.017563 64.173979
21 31.580732 64.173979
22 36.038126 64.173979
23 31.750105 64.173979
24 28.854629 70.663741
25 34.495191 70.663741
73 89.739823 66.979339
138 NaN NaN
139 NaN NaN
140 NaN NaN
142 NaN NaN
144 48.349965 62.767847
145 72.226331 62.767847
149 46.029355 65.594238
151 43.696804 65.594238
152 53.355036 61.617039
153 39.128852 61.617039
154 47.950009 61.617039
156 82.383342 81.202145
157 82.135800 81.202145
160 70.219241 76.117435
168 68.108595 71.736228
181 59.657466 76.452805
219 83.878809 46.624557
220 63.211857 74.332222
221 60.370669 74.332222
222 68.098965 70.020375
223 63.355079 70.020375
224 61.167339 72.555964
226 70.207752 74.884357
241 80.322566 65.252924
Standardised cancer death rate Standardised heart disease death rate \
20 111.71 177.01
21 107.45 161.38
22 103.87 146.30
23 93.60 179.74
24 86.49 158.98
25 100.09 157.09
73 70.27 25.54
138 56.47 25.77
139 66.75 30.84
140 56.24 46.99
142 67.70 68.42
144 99.29 54.76
145 105.62 71.97
149 140.73 106.28
151 143.70 98.91
152 147.60 129.75
153 142.37 118.43
154 137.49 115.55
156 66.51 33.29
157 65.78 29.42
160 64.99 22.22
168 65.88 32.01
181 59.83 25.77
219 88.70 46.17
220 116.66 123.41
221 110.47 107.58
222 118.95 111.04
223 123.76 113.24
224 118.27 118.95
226 106.41 116.69
241 95.31 85.59
Years of life lost due to air pollution Index of positive emotions \
20 1595.331861 69.662316
21 1403.075472 60.985101
22 1400.292431 59.174885
23 1258.053269 56.982241
24 1435.164438 69.834075
25 1550.893974 62.956455
73 231.271641 79.335212
138 NaN NaN
139 NaN NaN
140 NaN NaN
142 NaN NaN
144 625.269271 69.483735
145 1215.616194 69.834275
149 982.421809 63.924973
151 1018.261308 68.514176
152 1450.165143 62.924879
153 1199.113915 64.627739
154 1032.157366 59.852810
156 132.990230 77.708654
157 108.084469 75.688084
160 387.721102 67.905975
168 531.279547 67.172942
181 140.060215 70.450629
219 NaN 73.135644
220 1022.404058 58.620701
221 934.829901 65.335070
222 1153.484740 68.277262
223 788.980873 69.396496
224 1142.595567 63.121105
226 1430.731088 63.540169
241 1248.576541 72.737662
Air pollution NO2 Air pollution Ozone (SOMO35) Air pollution pm2.5 \
20 14.358467 3034.710975 16.199397
21 16.074244 3101.807327 14.803620
22 15.515333 3242.947617 14.763625
23 15.779245 3112.482382 13.760083
24 21.384778 2691.976191 15.048761
25 17.253786 3371.289643 15.903341
73 9.225796 2542.431365 8.136573
138 NaN NaN NaN
139 NaN NaN NaN
140 NaN NaN NaN
142 NaN NaN NaN
144 10.537707 6110.991568 11.423375
145 18.700000 5134.266667 17.700000
149 12.349501 4566.467173 12.895554
151 11.035812 4623.226746 13.225339
152 12.703328 4200.720981 16.813405
153 14.119673 3973.778091 14.694741
154 13.970837 4321.338619 13.346143
156 6.476355 2401.300376 7.364581
157 10.894495 1576.778311 6.955132
160 15.270986 6607.424456 9.824047
168 11.568476 6339.201472 11.679124
181 14.000000 3486.166667 7.400000
219 NaN NaN NaN
220 17.183943 3119.923300 13.547579
221 17.491909 2736.542365 12.781361
222 16.701928 2686.237008 14.648731
223 17.708084 3049.873969 11.505344
224 17.101476 3206.913901 14.568413
226 15.526070 3521.443159 17.119240
241 11.888746 3743.048134 17.288746
Bathing water quality Trust in the national government \
20 NaN 26.260028
21 NaN 15.006224
22 0.767442 6.466323
23 1.000000 20.480822
24 NaN 25.659458
25 NaN 18.695278
73 0.891156 64.768757
138 0.703125 NaN
139 0.645161 NaN
140 0.090909 NaN
142 0.285714 NaN
144 0.984496 28.180843
145 0.052632 40.392784
149 0.739130 46.168330
151 0.720000 41.732612
152 0.357143 44.408648
153 0.384615 33.667501
154 0.280000 42.039402
156 0.896552 67.208976
157 0.677419 62.531003
160 0.856448 38.906744
168 0.781609 38.345288
181 0.823529 35.446173
219 0.827586 53.788383
220 NaN 15.084238
221 NaN 17.013996
222 NaN 16.551924
223 0.840000 25.826363
224 NaN 19.690609
226 NaN 21.236413
241 0.600000 25.835636
Trust in the legal system Trust in the police \
20 24.777231 63.754444
21 26.045265 62.817532
22 17.230189 50.691063
23 28.126740 72.332635
24 9.929840 46.140749
25 25.092784 69.510244
73 87.825540 84.103289
138 NaN NaN
139 NaN NaN
140 NaN NaN
142 NaN NaN
144 24.530729 80.213798
145 35.542155 74.470215
149 43.320446 77.247959
151 52.780813 78.328057
152 48.043374 64.555192
153 48.620709 68.640397
154 41.307942 73.266985
156 69.660096 77.392460
157 66.247598 83.366891
160 48.991338 77.473463
168 46.867920 77.486812
181 48.133744 70.158365
219 49.710758 78.591264
220 37.831775 60.391196
221 38.308302 64.139650
222 42.362445 59.209724
223 44.351449 71.066672
224 38.591906 62.392230
226 40.821090 71.391120
241 57.312993 81.962150
Voiced opinion to public official \
20 17.232252
21 15.617317
22 12.249868
23 23.930437
24 20.637907
25 12.206946
73 41.638605
138 NaN
139 NaN
140 NaN
142 NaN
144 16.658870
145 18.352706
149 20.782973
151 15.948091
152 33.549440
153 18.170044
154 30.945189
156 26.684969
157 25.694373
160 20.726316
168 16.155737
181 12.229954
219 36.653251
220 20.346496
221 17.183250
222 20.643253
223 28.289891
224 24.953060
226 23.679496
241 22.007959
Female participation in regional assemblies Institution quality index \
20 23.750000 -2.703
21 23.750000 -1.392
22 23.750000 -1.406
23 23.750000 -1.712
24 23.750000 -2.160
25 23.750000 -2.694
73 47.619048 1.781
138 48.780488 -1.204
139 45.098039 -0.839
140 40.000000 -1.508
142 50.000000 -1.968
144 29.927007 -0.789
145 39.583333 -1.240
149 16.981132 -0.874
151 10.416667 -1.096
152 11.864407 -1.147
153 14.705882 -1.095
154 14.754098 -1.018
156 27.802691 1.208
157 27.802691 1.032
160 19.354839 0.041
168 24.285714 -2.116
181 35.000000 1.235
219 29.787234 0.429
220 21.319797 -1.041
221 18.000000 -1.238
222 19.718310 -1.620
223 20.975610 -1.729
224 21.459227 -1.755
226 16.265060 -1.302
241 10.743802 -0.811
Freedom over life choices Job opportunities Teenage pregnancy \
20 79.932867 16.523476 47.283763
21 64.996804 14.124578 28.625954
22 80.641157 19.366971 25.999151
23 73.586025 22.753249 63.776133
24 70.934462 30.302663 21.334576
25 61.620352 35.250503 42.887199
73 92.412159 75.818968 1.086778
138 NaN NaN 11.798980
139 NaN NaN 12.450852
140 NaN NaN 69.213612
142 NaN NaN 71.783842
144 62.843158 36.848069 3.040202
145 62.089196 58.329276 3.131277
149 79.881765 53.005955 13.340134
151 84.585888 37.816191 19.900971
152 71.567585 37.858879 43.128906
153 72.606966 37.829059 31.915989
154 71.640371 36.610524 13.689940
156 89.788114 61.826880 4.283756
157 87.391734 56.608397 5.052588
160 68.964543 32.359360 2.258866
168 75.993345 18.331588 7.728705
181 82.573643 44.551035 2.831613
219 85.051266 57.727387 4.492007
220 86.472665 48.446216 33.829745
221 81.479900 45.584064 46.593688
222 91.418181 21.737960 29.886090
223 79.014832 34.930716 37.524013
224 80.123854 31.940370 40.317092
226 82.534399 30.514953 34.315044
241 69.956341 11.860318 49.449936
Young people not in education, employment or training (NEET) \
20 25.466667
21 13.733333
22 13.733333
23 19.400000
24 7.166667
25 14.666667
73 6.300000
138 18.400000
139 18.033333
140 28.333333
142 20.800000
144 12.833333
145 6.833333
149 8.466667
151 12.666667
152 17.300000
153 16.700000
154 9.533333
156 8.533333
157 8.766667
160 15.366667
168 28.633333
181 7.433333
219 12.500000
220 12.200000
221 24.100000
222 13.466667
223 21.500000
224 18.333333
226 20.433333
241 16.100000
Institutions corruption index (EQI) \
20 -2.563
21 -0.110
22 -1.049
23 -1.153
24 -1.806
25 -1.481
73 1.513
138 -1.174
139 -0.565
140 -0.894
142 -1.511
144 -0.798
145 -1.189
149 -1.055
151 -1.209
152 -1.492
153 -1.587
154 -1.368
156 1.258
157 0.641
160 0.178
168 -1.320
181 1.199
219 -0.267
220 -1.033
221 -0.552
222 -1.390
223 -1.343
224 -1.600
226 -1.091
241 -1.114
Institution impartiality index (EQI) Tolerance towards immigrants \
20 -1.648 26.633780
21 -0.100 28.055398
22 -0.325 24.849501
23 -0.948 34.070780
24 -1.541 34.270100
25 -2.293 49.345868
73 1.157 88.113235
138 -0.381 NaN
139 -0.043 NaN
140 -0.347 NaN
142 0.130 NaN
144 -0.855 39.516334
145 -1.424 56.059397
149 -0.624 40.264304
151 -1.071 43.036625
152 -1.009 36.170510
153 -1.497 33.861097
154 -1.019 36.533368
156 0.893 85.557875
157 0.948 85.420738
160 0.014 74.070436
168 -2.452 75.650138
181 1.296 63.277318
219 0.523 91.311816
220 -0.841 60.722968
221 -0.537 58.441035
222 -1.085 45.662320
223 -1.218 48.234844
224 -0.743 47.866458
226 -0.989 50.827081
241 -1.146 49.119725
Tolerance towards minorities Tolerance towards gay or lesbian people \
20 60.040456 21.227692
21 59.526223 13.980755
22 57.038981 15.248122
23 50.858231 11.725824
24 53.966053 23.253465
25 74.447695 39.809715
73 85.831922 90.188596
138 NaN NaN
139 NaN NaN
140 NaN NaN
142 NaN NaN
144 52.916913 36.052889
145 68.049388 55.800389
149 63.171081 47.773431
151 74.218147 47.364547
152 71.549399 30.769921
153 62.175433 33.227596
154 66.929844 33.853538
156 85.770663 78.829749
157 89.092996 77.075454
160 86.405279 73.783663
168 82.474538 70.190504
181 72.473454 42.838908
219 87.293176 71.365073
220 77.058538 21.246246
221 76.111439 21.735440
222 53.359268 9.609959
223 64.495078 14.518233
224 64.818275 23.839380
226 66.376230 9.167170
241 59.991146 30.219435
Women treated with respect Tertiary education attainment \
20 71.211593 18.7
21 65.648795 26.4
22 47.779465 28.1
23 65.425224 21.7
24 64.714521 43.5
25 73.040689 22.9
73 80.332996 53.1
138 NaN 24.2
139 NaN 29.3
140 NaN 22.0
142 NaN 24.8
144 59.638205 24.9
145 55.304711 43.8
149 69.291904 22.7
151 67.394845 22.0
152 61.264279 19.9
153 56.124928 20.6
154 70.282286 22.8
156 86.626445 50.2
157 81.755746 56.9
160 59.384062 22.3
168 47.124230 15.2
181 80.405940 52.3
219 59.646254 22.3
220 46.496009 18.2
221 45.435139 18.9
222 34.534991 14.0
223 45.236989 14.0
224 36.360040 13.7
226 36.397542 17.6
241 81.474170 28.2
Lifelong learning Academic citations per 1000 persons \
20 1.0 0.073352
21 1.7 0.261241
22 0.9 0.367248
23 1.9 0.248473
24 2.4 1.360754
25 1.4 0.366007
73 31.2 21.182476
138 6.0 0.326001
139 9.6 0.196017
140 6.6 0.598639
142 8.0 NaN
144 4.5 NaN
145 7.2 1.493249
149 7.0 0.714710
151 10.6 1.214508
152 8.5 0.358730
153 8.2 1.380873
154 8.3 2.154984
156 10.9 4.972473
157 12.6 6.894654
160 11.4 5.340761
168 6.3 3.587440
181 18.1 5.992004
219 9.0 1.898200
220 7.7 1.719081
221 2.5 0.604914
222 7.2 0.996194
223 6.9 0.257528
224 6.1 0.105659
226 3.0 0.403420
241 10.6 1.376123
gdp_per_capita_2022
20 8500.0
21 8900.0
22 10300.0
23 11900.0
24 20800.0
25 9300.0
73 90400.0
138 25300.0
139 27000.0
140 15600.0
142 11500.0
144 16800.0
145 31900.0
149 16200.0
151 12000.0
152 11400.0
153 11400.0
154 12700.0
156 120300.0
157 104100.0
160 35700.0
168 20100.0
181 118700.0
219 23700.0
220 13800.0
221 14100.0
222 9100.0
223 11900.0
224 11300.0
226 11300.0
241 14600.0
Outliers de cada variable en función de IQR
[Figura 00.5]
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
outliers_summary = pd.DataFrame(columns=['Variable', 'Outliers', '% Outliers'])
for col in num_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
num_outliers = outliers.shape[0]
perc_outliers = round((num_outliers / df.shape[0]) * 100,2)
outliers_summary = pd.concat([outliers_summary, pd.DataFrame({'Variable': [col], 'Outliers': [num_outliers], '% Outliers': [perc_outliers]})], ignore_index=True)
outliers_summary = outliers_summary[outliers_summary['Outliers'] > 0]
print("Figura 00.5")
print()
print(outliers_summary)
Figura 00.5
Variable Outliers % Outliers
0 Infant mortality 15 6.20
1 Satisfaction with water quality 3 1.24
2 Uncollected sewage 59 24.38
3 Sewage treatment, additional 21 8.68
4 Safety at night 5 2.07
7 Traffic deaths 11 4.55
8 Share of low-achieving 15 year olds in reading... 60 24.79
9 Share of low-achieving 15 year olds in maths a... 27 11.16
10 Lower-secondary completion only 9 3.72
11 Early school leavers 6 2.48
12 Broadband at home 5 2.07
13 Digital skills above basic level 10 4.13
15 Internet access 20 8.26
16 Freedom of media 6 2.48
17 Subjective health status 8 3.31
18 Standardised cancer death rate 12 4.96
19 Standardised heart disease death rate 23 9.50
20 Years of life lost due to air pollution 4 1.65
21 Index of positive emotions 2 0.83
22 Air pollution NO2 8 3.31
23 Air pollution Ozone (SOMO35) 7 2.89
24 Air pollution pm2.5 4 1.65
25 Bathing water quality 17 7.02
28 Trust in the police 24 9.92
29 Voiced opinion to public official 1 0.41
32 Freedom over life choices 8 3.31
34 Teenage pregnancy 25 10.33
35 Young people not in education, employment or t... 10 4.13
38 Tolerance towards immigrants 13 5.37
39 Tolerance towards minorities 2 0.83
40 Tolerance towards gay or lesbian people 2 0.83
41 Women treated with respect 3 1.24
43 Lifelong learning 19 7.85
44 Academic citations per 1000 persons 7 2.89
45 gdp_per_capita_2022 4 1.65
Outliers por pais
[Figura 00.6]
country_column = 'Country'
outliers_summary = pd.DataFrame(columns=['Variable'] + df[country_column].unique().tolist())
for col in num_cols:
outliers_data = {'Variable': col}
for country in df[country_column].unique():
df_country = df[df[country_column] == country]
Q1 = df_country[col].quantile(0.25)
Q3 = df_country[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df_country[(df_country[col] < lower_bound) | (df_country[col] > upper_bound)]
num_outliers = outliers.shape[0]
outliers_data[country] = num_outliers
outliers_summary = pd.concat([outliers_summary, pd.DataFrame(outliers_data, index=[0])], ignore_index=True)
styled_outliers_summary = outliers_summary.style.background_gradient(cmap='Reds', subset=pd.IndexSlice[:, outliers_summary.columns != 'Variable'])
print("Figura 00.6")
print()
styled_outliers_summary
Figura 00.6
| Variable | AT | BE | BG | CY | CZ | DE | DK | EE | EL | ES | FI | FR | HR | HU | IE | IT | LT | LU | LV | MT | NL | PL | PT | RO | SE | SI | SK | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Infant mortality | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 4 | 2 | 5 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | Satisfaction with water quality | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | Uncollected sewage | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | Sewage treatment, additional | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | Safety at night | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 5 | Money Stolen | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 1 | 7 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 |
| 6 | Assaulted/Mugged | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 7 | Traffic deaths | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 8 | Share of low-achieving 15 year olds in reading (level 1a or lower) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | Share of low-achieving 15 year olds in maths and science | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10 | Lower-secondary completion only | 1 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 0 |
| 11 | Early school leavers | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 12 | Broadband at home | 0 | 0 | 1 | 0 | 1 | 3 | 0 | 0 | 1 | 0 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
| 13 | Digital skills above basic level | 2 | 0 | 1 | 0 | 1 | 3 | 0 | 0 | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
| 14 | Online interaction with public authorities | 1 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
| 15 | Internet access | 0 | 0 | 1 | 0 | 1 | 5 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 2 | 0 | 1 |
| 16 | Freedom of media | 0 | 0 | 1 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | Subjective health status | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 18 | Standardised cancer death rate | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 |
| 19 | Standardised heart disease death rate | 0 | 0 | 0 | 0 | 1 | 3 | 1 | 0 | 4 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 20 | Years of life lost due to air pollution | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 21 | Index of positive emotions | 2 | 0 | 0 | 0 | 1 | 15 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 22 | Air pollution NO2 | 1 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 |
| 23 | Air pollution Ozone (SOMO35) | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 24 | Air pollution pm2.5 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | Bathing water quality | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 26 | Trust in the national government | 2 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 27 | Trust in the legal system | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 2 | 0 | 1 |
| 28 | Trust in the police | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 2 | 0 | 0 |
| 29 | Voiced opinion to public official | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 30 | Female participation in regional assemblies | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| 31 | Institution quality index | 0 | 0 | 0 | 0 | 2 | 1 | 2 | 0 | 0 | 0 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32 | Freedom over life choices | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 0 | 0 | 0 |
| 33 | Job opportunities | 2 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 34 | Teenage pregnancy | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 35 | Young people not in education, employment or training (NEET) | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 36 | Institutions corruption index (EQI) | 0 | 1 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 37 | Institution impartiality index (EQI) | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38 | Tolerance towards immigrants | 0 | 0 | 1 | 0 | 1 | 0 | 2 | 0 | 3 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| 39 | Tolerance towards minorities | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| 40 | Tolerance towards gay or lesbian people | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 41 | Women treated with respect | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 42 | Tertiary education attainment | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0 | 2 | 0 | 1 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 43 | Lifelong learning | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 44 | Academic citations per 1000 persons | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 45 | gdp_per_capita_2022 | 0 | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 1 | 0 | 1 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
[Figura 00.7]
num_cols= df.select_dtypes(include=['float64', 'int64'])
country_col = 'Country'
print("Figura 00.7")
for col in num_cols:
plt.figure(figsize=(12, 6))
sns.boxplot(x='Country', y=col, data=df, color='#9ecae1')
mean = df[col].mean()
plt.axhline(mean, color='red', linestyle='--', linewidth=1, label=f'Media ({mean:.2f})')
plt.title(col)
plt.grid(True)
plt.show()
Figura 00.7
print(outliers_df.Country.nunique())
outliers_df.RegionName.nunique()
11
31
Con Z Score se observan outliers en un total de 31 regiones de 11 paises. Es razonable pensar que existen unas pocas regiones por países con datos muy distantes a su media por distintas razones, geográficas o sociales
En el mapa observamos que la mayoría de los outliers se concentran en regiones muy especificas:
- Europa del Este: sobre todo Bulgaría, Rumanía y regiones adyacentes.
- Irlanda
- Territorios de ultramar de Francia y Portugal
A parte de estas tres zonas, se observan otras zonas muy repartidas sin sin relación aparente, al menos geográficamente
merged_gdf =pd.merge(gdf, outliers_df, how='right', left_on='NUTS_ID', right_on='NUTS code')
[Figura 00.8]
print("Figura 00.8")
fig, ax = plt.subplots(figsize=(12, 10))
ax.set_facecolor('lightgrey')
gdf.plot(ax=ax, color='#e5f5f9', edgecolor='gray', linewidth=0.5) # Mapa base
merged_gdf.plot(ax=ax, color='red', markersize=50, label='Outliers') # Outliers en rojo
ax.set_title('Outliers')
plt.show()
Figura 00.8
Distribución de los indicadores¶
[Figura 00.9]
print("Figura 00.9")
num_cols = df.select_dtypes(include=['number']).columns
fig, axs = plt.subplots(ncols=3, nrows=16, figsize=(20, 60))
axs = axs.flatten()
index = 0
for col in df[num_cols]:
sns.distplot(df[col], bins=20, ax=axs[index])
index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)
Figura 00.9
Correlación entre indicadores¶
[Figura 00.10]
print("Figura 00.10")
num_col = df.select_dtypes(include='number')
correlation_matrix = num_col.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
Figura 00.10
Guardar el dataset¶
df.to_csv('00_lectura_datos.csv', index = False)